Skip to content

Conversation

@Peiyingy
Copy link
Member

@Peiyingy Peiyingy commented Nov 18, 2025

Description

This PR adds Trino-cluster-based pinning as proposed in #789.
File-based routing rules can now specify actions like "result.put(\"routingCluster\", \"trino-1\")" to pin directly to a specific backend cluster. The rules engine evaluates all rules and selects the highest-priority match, which can either be a routing group or a routing cluster.

  • If the result is a routing group, the existing behavior applies.
  • If the result is a routing cluster, we verify that the cluster exists, is active, and is healthy. If it passes these checks, the query is routed to that cluster. If not, we fall back to the default routing group, matching the current behavior for routing group.

We also support external routing selector to return routingCluster as a routingDestination.

Logic Changes

  • File-based routing rules now support "result.put(\"routingCluster\", \"trino-1\")".
  • In FileBasedRoutingSelector.findRoutingDestination, we now ensure the result map contains exactly one entry so that only the highest-priority decision is kept.

Key Change Explanation

All routing rules are sorted by priority. In FileBasedRoutingSelector, we iterate through them from low to high priority, checking whether each rule matches. Each match overwrites the previous decision. By the end, we have the routing decision associated with the highest-priority rule.

Previously, we only had one possible result key: routingGroup, so following the behavior as described above, the result HashMap would return us only one routingGroup key, with the value of the highest priority routing group.

Now, we support both routingGroup and routingCluster, and using a regular HashMap can return two results:

  • routingGroup key with the value of highest priority group
  • routingCluster key with the value of the highest priority cluster.

At this point, we won't know which one actually has the higher priority.

To solve this issue, we switched the result map to a LinkedHashMap, with the eviction mechanism of removeEldestEntry to keep the map size at one. This ensures that as we traverse rules, a higher-priority routing cluster can override a lower-priority routing group, and vice versa. Thus, after all the rules are traversed, we end up with exactly one result - either a routing group or a routing cluster - with the globally highest priority.

Refactoring

  • Renamed RoutingGroupSelector to the more general RoutingSelector, with corresponding updates to all the classes implement it.
  • Renamed RoutingGroupResponse to RoutingResponse and added a routingCluster field.
  • Added a routingCluster field in RoutingSelectorResponse
  • Replaced routingGroup in RoutingDestination with a more general variable name routingDecision, which can be either a routing group or a routing cluster. This change also applies to:
    • the query_history table column
    • cache key
    • UI display

Testing

  • mvn clean install
  • Added new test cases in TestRoutingSelector
  • Local test to verify the routing cluster and routing group pinning both work
Screenshot 2025-11-17 at 23 47 48

Apply local changes from detached HEAD

wip

wip

erroeous ver

wip

clean build

add comments

add tests

clean up front end code

wip

update routing setting

clean up frontend code
@cla-bot cla-bot bot added the cla-signed label Nov 18, 2025
@Peiyingy Peiyingy marked this pull request as ready for review November 18, 2025 07:49
@RoeyoOgen
Copy link
Contributor

RoeyoOgen commented Nov 18, 2025

in PR description you say:

If the result is a routing cluster, we verify that the cluster exists, is active, and is healthy. If it passes these checks, the query is routed to that cluster. If not, we fall back to the adhoc routing group, matching the current behavior for routing group.

did you make sure to accommodate:
routing: defaultRoutingGroup: config ?

Copy link
Contributor

@RoeyoOgen RoeyoOgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, i might have missed but i don't see any new tests just fixing existing ones

routingRules:
rulesEngineEnabled: False
# rulesConfigPath: "src/main/resources/rules/routing_rules.yml"
# rulesConfigPath: "gateway-ha/src/main/resources/rules/routing_rules.yml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have this change?
also why -ha?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I'm testing with TrinoGatewayRunner, it sets the working directory to the repository root. For instance, to read the config.yaml file, we need to put gateway-ha directory in the path

HaGatewayLauncher.main(new String[] {"gateway-ha/config.yaml"});

Wondering what's the reason we'd like to set it to rulesConfigPath: "src/main/resources/rules/routing_rules.yml"?

{
Map<String, String> result = new HashMap<>();
// Keep only the highest-priority rule result by limiting the map to a single entry.
LinkedHashMap<String, String> result = new LinkedHashMap<>(1) { @Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this this change needed? is this a current issue or changed by added behaviour?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I've added the Key Change Explanation section in the PR description to explain this change

@Peiyingy
Copy link
Member Author

in PR description you say:

If the result is a routing cluster, we verify that the cluster exists, is active, and is healthy. If it passes these checks, the query is routed to that cluster. If not, we fall back to the adhoc routing group, matching the current behavior for routing group.

did you make sure to accommodate: routing: defaultRoutingGroup: config ?

Yes, If there's no cluster matching the rule exists, we would fall back to the defaultRoutingGroup. The logic is written here.

.orElseGet(() -> provideDefaultBackendConfiguration(user));

I've also modified the PR description to make it clear.

@Peiyingy
Copy link
Member Author

also, i might have missed but i don't see any new tests just fixing existing ones

New tests are added in TestRoutingSelector

void testPinByRoutingCluster() {
RoutingSelector routingSelector =
RoutingSelector.byRoutingRulesEngine("src/test/resources/rules/routing_rules_group_and_cluster.yml",
oneHourRefreshPeriod,
requestAnalyzerConfig);
HttpServletRequest mockRequest = new QueryRequestMock()
.httpHeader(TrinoQueryProperties.TRINO_CATALOG_HEADER_NAME, DEFAULT_CATALOG)
.httpHeader(TrinoQueryProperties.TRINO_SCHEMA_HEADER_NAME, DEFAULT_SCHEMA)
.requestAnalyzerConfig(requestAnalyzerConfig)
.getHttpServletRequest();
when(mockRequest.getHeader("X-Trino-User")).thenReturn("user1");
RoutingSelectorResponse routingSelectorResponse = routingSelector.findRoutingDestination(mockRequest);
assertThat(routingSelectorResponse.routingGroup()).isNull();
assertThat(routingSelectorResponse.routingCluster()).isEqualTo("cluster01");
}
@Test
void testHigherPriorityRoutingRuleWins() {
RoutingSelector routingSelector =
RoutingSelector.byRoutingRulesEngine("src/test/resources/rules/routing_rules_group_and_cluster.yml",
oneHourRefreshPeriod,
requestAnalyzerConfig);
HttpServletRequest mockRequestForGroup = new QueryRequestMock()
.httpHeader(TrinoQueryProperties.TRINO_CATALOG_HEADER_NAME, DEFAULT_CATALOG)
.httpHeader(TrinoQueryProperties.TRINO_SCHEMA_HEADER_NAME, DEFAULT_SCHEMA)
.requestAnalyzerConfig(requestAnalyzerConfig)
.getHttpServletRequest();
when(mockRequestForGroup.getHeader("X-Trino-User")).thenReturn("user2");
RoutingSelectorResponse responseForGroup = routingSelector.findRoutingDestination(mockRequestForGroup);
// For user2: routingCluster has higher priority than routingGroup (see routing_rules_group_and_cluster.yml).
// The capped LinkedHashMap keeps only the last (highest-priority) entry, so the routingGroup is evicted.
assertThat(responseForGroup.routingGroup()).isNull();
assertThat(responseForGroup.routingCluster()).isEqualTo("adhoc01");
HttpServletRequest mockRequestForCluster = new QueryRequestMock()
.httpHeader(TrinoQueryProperties.TRINO_CATALOG_HEADER_NAME, DEFAULT_CATALOG)
.httpHeader(TrinoQueryProperties.TRINO_SCHEMA_HEADER_NAME, DEFAULT_SCHEMA)
.requestAnalyzerConfig(requestAnalyzerConfig)
.getHttpServletRequest();
when(mockRequestForCluster.getHeader("X-Trino-User")).thenReturn("user3");
RoutingSelectorResponse responseForCluster = routingSelector.findRoutingDestination(mockRequestForCluster);
// For user3: routingGroup has higher priority than routingCluster (see routing_rules_group_and_cluster.yml).
assertThat(responseForCluster.routingGroup()).isEqualTo("adhoc");
assertThat(responseForCluster.routingCluster()).isNull();
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

2 participants